2260 unique songs was obtained from the playlists “Today’s Top Hits”, ‘Pop Rising’, ‘Hot Rhythmic’, ‘Mega Hit Mix’, ‘New Music Friday’, ‘Hit Rewind’, ‘Teen Party’, ‘Guilty Pleasures’, ‘Women of Pop’, ‘Soft Pop Hits’, ‘African Heat’, ‘Acoustic Hits’, ‘Fresh & Chill’, ‘Bedroom Pop’, ‘Everyday Favorites’, ‘Global X’, ‘Contemporary Blend’, ‘Fangirls Run the World’, ‘Singled Out’, ‘Left of Center’, ‘Afropop’, ‘Pop Sauce’, ‘Mellow Pop’, ‘Wa-oh-wa-oh!’, ‘Out Now’, ‘Pop Royalty’, ‘Workday: Pop’, ‘Now Hear This’, ‘Certified Gold’, ‘Crowd Pleasers’, ‘LA Pops’, ‘LADY GAGA / JOANNE’, ‘Pop Matters’, ‘Retro Pop’, “Tomorrow’s Hits”, ‘All A Cappella’, ‘Yalla Araby’, ‘Persian Essentials’, ‘Radio 1 Playlist (BBC)’, ‘Wild Cards: Winter Mix’, ‘Arab X’, ‘Fresh Finds: Poptronix’, ‘The GRAMMYs Official Playlist’, ‘Pop Chile’, “Today’s Top Egyptian Hits”, “Today’s Top Maghreb Hits”.
## liveness tempo energy speechiness mode instrumentalness
## 1 0.2760 120.966 0.768 0.0360 1 4.49e-05
## 2 0.1150 103.968 0.767 0.1860 0 0.00e+00
## 3 0.1370 113.981 0.780 0.0623 0 7.07e-06
## 4 0.0398 99.974 0.678 0.0514 0 2.12e-05
## 5 0.0505 139.943 0.545 0.0625 1 2.89e-04
## 6 0.1570 119.953 0.890 0.0405 0 0.00e+00
## name popularity acousticness loudness
## 1 Dance In The Dark 45 2.99e-05 -6.211
## 2 Yamen Yasar 23 4.30e-01 -2.073
## 3 Creatures 47 2.17e-01 -4.313
## 4 Mafeesh Menha 27 2.50e-01 -7.162
## 5 Boys 72 6.46e-02 -5.192
## 6 Why Are We So Broken (feat. blink-182) 37 2.63e-03 -5.016
## valence danceability
## 1 0.0986 0.645
## 2 0.8900 0.784
## 3 0.5500 0.745
## 4 0.8410 0.820
## 5 0.5250 0.867
## 6 0.3820 0.587
## [1] 20.64241
The data set covers a range of popularities with a standard deviation of 20.64241. The data set is slightly more centered towards the right tail.
The data set is split into a training set and a testings set.
To investigate what factors influence a song’s popularity, we first run a naive MLR model, where the regressors are energy, valence, liveness, tempo, speechiness, instrumentalness, acousticness, loudness, and danceability.
##
## Call:
## lm(formula = (popularity) ~ energy + valence + liveness + tempo +
## speechiness + instrumentalness + acousticness + loudness +
## danceability)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.795 -13.490 1.701 14.663 49.899
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.496e+01 5.121e+00 14.637 < 2e-16 ***
## energy -1.790e+01 4.207e+00 -4.254 2.19e-05 ***
## valence -1.646e+01 2.374e+00 -6.933 5.53e-12 ***
## liveness -8.647e+00 3.312e+00 -2.611 0.00911 **
## tempo -2.375e-04 1.629e-02 -0.015 0.98837
## speechiness 2.112e+00 5.426e+00 0.389 0.69718
## instrumentalness -1.631e+00 3.987e+00 -0.409 0.68257
## acousticness -8.506e+00 2.164e+00 -3.931 8.76e-05 ***
## loudness 1.878e+00 2.516e-01 7.463 1.26e-13 ***
## danceability 2.138e+01 3.539e+00 6.042 1.82e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.71 on 1990 degrees of freedom
## Multiple R-squared: 0.09236, Adjusted R-squared: 0.08825
## F-statistic: 22.5 on 9 and 1990 DF, p-value: < 2.2e-16
##
## One-sample Kolmogorov-Smirnov test
##
## data: resid(naive_mlr)/sigma(naive_mlr)
## D = 0.03722, p-value = 0.007843
## alternative hypothesis: two-sided
The estimates indicate that there is a negative relationship between popularity and energy, valence, liveness, tempo, instrumentalness, acousticness. Most of the results make sense. For example, generally, people prefer to play music recorded in studios, so songs with low liveness are more popular. In 2018, popular music is increasingly electronic, so it makes sense that acousticness have a negative coefficient. Also, catchy music often have lyrics, which results in a low instrumentalness.
However, it is unclear why energy, valence, and tempo also have negative coefficients. It is assumed that generally pop music is energetic and happy.
In terms of standard error of each estimates, it is very small and therefore desired. It follows that most estimates are highly significant, except for tempo, speechiness, and instrumentalness.
The \(R^2\) value is relatively small, which means that only a small portion of the variability in observations is explained by this model.
Our F statistics, however, is high significant. It shows that our model is highly significant.
Next, we will focus on improving the \(R^2\) value.
##
## Call:
## lm(formula = (popularity^1.3) ~ energy + valence + liveness +
## tempo + speechiness + instrumentalness + acousticness + loudness +
## danceability)
##
## Residuals:
## Min 1Q Median 3Q Max
## -229.070 -59.354 2.293 60.016 217.020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 269.798236 21.131204 12.768 < 2e-16 ***
## energy -72.792865 17.360787 -4.193 2.87e-05 ***
## valence -69.089088 9.797541 -7.052 2.43e-12 ***
## liveness -33.663446 13.667177 -2.463 0.013859 *
## tempo 0.006854 0.067225 0.102 0.918797
## speechiness 6.786891 22.389726 0.303 0.761826
## instrumentalness -10.368091 16.449525 -0.630 0.528572
## acousticness -34.281698 8.928764 -3.839 0.000127 ***
## loudness 7.823709 1.038207 7.536 7.33e-14 ***
## danceability 88.193384 14.603822 6.039 1.84e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81.33 on 1990 degrees of freedom
## Multiple R-squared: 0.093, Adjusted R-squared: 0.0889
## F-statistic: 22.67 on 9 and 1990 DF, p-value: < 2.2e-16
##
## One-sample Kolmogorov-Smirnov test
##
## data: resid(trans_mlr)/sigma(trans_mlr)
## D = 0.024286, p-value = 0.1888
## alternative hypothesis: two-sided
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'optimize' will be disregarded
##
## Call:
## lm(formula = (popularity^lambda - 1)/lambda ~ energy + valence +
## liveness + tempo + speechiness + instrumentalness + acousticness +
## loudness + danceability)
##
## Residuals:
## Min 1Q Median 3Q Max
## -176.208 -45.657 1.764 46.166 166.939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 206.767874 16.254772 12.720 < 2e-16 ***
## energy -55.994512 13.354452 -4.193 2.87e-05 ***
## valence -53.145452 7.536570 -7.052 2.43e-12 ***
## liveness -25.894959 10.513213 -2.463 0.013859 *
## tempo 0.005273 0.051711 0.102 0.918797
## speechiness 5.220685 17.222866 0.303 0.761826
## instrumentalness -7.975455 12.653481 -0.630 0.528572
## acousticness -26.370537 6.868280 -3.839 0.000127 ***
## loudness 6.018238 0.798621 7.536 7.33e-14 ***
## danceability 67.841064 11.233709 6.039 1.84e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.56 on 1990 degrees of freedom
## Multiple R-squared: 0.093, Adjusted R-squared: 0.0889
## F-statistic: 22.67 on 9 and 1990 DF, p-value: < 2.2e-16
##
## One-sample Kolmogorov-Smirnov test
##
## data: resid(bc_trans_mlr)/sigma(bc_trans_mlr)
## D = 0.024286, p-value = 0.1888
## alternative hypothesis: two-sided
Let’s first visualization the distribution of the regressor speechiness, tempo and instrumentalness.
Below is a histogram for the distribution of speechiness, danceability, energy, liveness, valence, and instrumentalness.
We can see that speechiness, instrumentalness, acousticness, and liveness are heavily centered around 0.
## No id variables; using all as measure variables
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Saving 7 x 5 in image
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The pair plot comfirmed our observation that some regressor does not follow a normal distribution. Also, there little correlation between the regressors, except between energy and loudness.
From the plots above, we can see a few regressors are not normally distributed like liveness and speechiness. A log transformation is applied to both variables.
##
## Call:
## lm(formula = (popularity^1.3 - 1)/1.3 ~ energy + valence + log(liveness) +
## tempo + log(speechiness) + instrumentalness + acousticness +
## loudness + danceability)
##
## Residuals:
## Min 1Q Median 3Q Max
## -175.54 -46.08 1.89 45.94 164.09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 194.136877 19.038247 10.197 < 2e-16 ***
## energy -56.221713 13.390758 -4.199 2.80e-05 ***
## valence -53.481201 7.530899 -7.102 1.71e-12 ***
## log(liveness) -6.251298 2.318216 -2.697 0.007064 **
## tempo 0.003513 0.051729 0.068 0.945864
## log(speechiness) 0.892653 2.098486 0.425 0.670606
## instrumentalness -8.286410 12.664369 -0.654 0.512988
## acousticness -26.011240 6.869857 -3.786 0.000157 ***
## loudness 6.040478 0.798016 7.569 5.70e-14 ***
## danceability 66.800868 11.327930 5.897 4.34e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.54 on 1990 degrees of freedom
## Multiple R-squared: 0.09358, Adjusted R-squared: 0.08948
## F-statistic: 22.83 on 9 and 1990 DF, p-value: < 2.2e-16
As we can see, \(R^2\) improved while maintaining a significant model.
Now we try to add a categorical variable, mode to see if \(R^2\) can be further improved. Mode encodes major scale as 1 and minor as 0. As shown below, \(R^2\) is increased 7.68% from the naive MLR.
## F
## 0 1
## 818 1182
##
## Call:
## lm(formula = popularity ~ energy + valence + log(liveness) +
## tempo + log(speechiness) + instrumentalness + acousticness +
## loudness + danceability + mode)
##
## Residuals:
## Min 1Q Median 3Q Max
## -178.099 -44.685 1.022 45.429 167.439
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 187.2958 19.0936 9.809 < 2e-16 ***
## energy -54.0742 13.3701 -4.044 5.45e-05 ***
## valence -53.0072 7.5122 -7.056 2.35e-12 ***
## log(liveness) -5.8902 2.3145 -2.545 0.01101 *
## tempo 0.0013 0.0516 0.025 0.97990
## log(speechiness) 1.4812 2.1000 0.705 0.48069
## instrumentalness -8.3387 12.6308 -0.660 0.50921
## acousticness -26.4130 6.8526 -3.854 0.00012 ***
## loudness 5.8750 0.7974 7.368 2.53e-13 ***
## danceability 68.4358 11.3080 6.052 1.71e-09 ***
## mode 9.7783 2.8701 3.407 0.00067 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.38 on 1989 degrees of freedom
## Multiple R-squared: 0.09884, Adjusted R-squared: 0.0943
## F-statistic: 21.81 on 10 and 1989 DF, p-value: < 2.2e-16
The difference between the popularity of a song resulting from changing from a minor scale to a major scale is 3.15.
## Start: AIC=16543.67
## popularity ~ energy + valence + log(liveness) + tempo + log(speechiness) +
## instrumentalness + acousticness + loudness + danceability +
## mode
##
## Df Sum of Sq RSS AIC
## - tempo 1 2 7738639 16542
## - instrumentalness 1 1696 7740333 16542
## - log(speechiness) 1 1936 7740572 16542
## <none> 7738637 16544
## - log(liveness) 1 25199 7763836 16548
## - mode 1 45161 7783798 16553
## - acousticness 1 57803 7796440 16557
## - energy 1 63642 7802279 16558
## - danceability 1 142502 7881139 16578
## - valence 1 193716 7932353 16591
## - loudness 1 211207 7949844 16596
##
## Step: AIC=16541.67
## popularity ~ energy + valence + log(liveness) + log(speechiness) +
## instrumentalness + acousticness + loudness + danceability +
## mode
##
## Df Sum of Sq RSS AIC
## - instrumentalness 1 1693 7740333 16540
## - log(speechiness) 1 1987 7740627 16540
## <none> 7738639 16542
## - log(liveness) 1 25231 7763870 16546
## - mode 1 45177 7783816 16551
## - acousticness 1 57994 7796634 16555
## - energy 1 63690 7802330 16556
## - danceability 1 148454 7887094 16578
## - valence 1 194046 7932685 16589
## - loudness 1 211206 7949845 16594
##
## Step: AIC=16540.11
## popularity ~ energy + valence + log(liveness) + log(speechiness) +
## acousticness + loudness + danceability + mode
##
## Df Sum of Sq RSS AIC
## - log(speechiness) 1 2334 7742667 16539
## <none> 7740333 16540
## - log(liveness) 1 24943 7765276 16544
## - mode 1 45146 7785479 16550
## - acousticness 1 57856 7798188 16553
## - energy 1 67265 7807597 16555
## - danceability 1 147437 7887770 16576
## - valence 1 193478 7933811 16588
## - loudness 1 233382 7973715 16598
##
## Step: AIC=16538.71
## popularity ~ energy + valence + log(liveness) + acousticness +
## loudness + danceability + mode
##
## Df Sum of Sq RSS AIC
## <none> 7742667 16539
## - log(liveness) 1 24669 7767336 16543
## - mode 1 43776 7786443 16548
## - acousticness 1 58064 7800731 16552
## - energy 1 65381 7808048 16554
## - danceability 1 153980 7896647 16576
## - valence 1 191786 7934452 16586
## - loudness 1 231784 7974451 16596
## Start: AIC=16605.28
## popularity ~ energy + valence + log(liveness) + tempo + log(speechiness) +
## instrumentalness + acousticness + loudness + danceability +
## mode
##
## Df Sum of Sq RSS AIC
## - tempo 1 2 7738639 16598
## - instrumentalness 1 1696 7740333 16598
## - log(speechiness) 1 1936 7740572 16598
## - log(liveness) 1 25199 7763836 16604
## <none> 7738637 16605
## - mode 1 45161 7783798 16609
## - acousticness 1 57803 7796440 16613
## - energy 1 63642 7802279 16614
## - danceability 1 142502 7881139 16634
## - valence 1 193716 7932353 16647
## - loudness 1 211207 7949844 16652
##
## Step: AIC=16597.68
## popularity ~ energy + valence + log(liveness) + log(speechiness) +
## instrumentalness + acousticness + loudness + danceability +
## mode
##
## Df Sum of Sq RSS AIC
## - instrumentalness 1 1693 7740333 16590
## - log(speechiness) 1 1987 7740627 16591
## - log(liveness) 1 25231 7763870 16597
## <none> 7738639 16598
## - mode 1 45177 7783816 16602
## - acousticness 1 57994 7796634 16605
## - energy 1 63690 7802330 16606
## - danceability 1 148454 7887094 16628
## - valence 1 194046 7932685 16640
## - loudness 1 211206 7949845 16644
##
## Step: AIC=16590.51
## popularity ~ energy + valence + log(liveness) + log(speechiness) +
## acousticness + loudness + danceability + mode
##
## Df Sum of Sq RSS AIC
## - log(speechiness) 1 2334 7742667 16584
## - log(liveness) 1 24943 7765276 16589
## <none> 7740333 16590
## - mode 1 45146 7785479 16594
## - acousticness 1 57856 7798188 16598
## - energy 1 67265 7807597 16600
## - danceability 1 147437 7887770 16621
## - valence 1 193478 7933811 16632
## - loudness 1 233382 7973715 16642
##
## Step: AIC=16583.52
## popularity ~ energy + valence + log(liveness) + acousticness +
## loudness + danceability + mode
##
## Df Sum of Sq RSS AIC
## - log(liveness) 1 24669 7767336 16582
## <none> 7742667 16584
## - mode 1 43776 7786443 16587
## - acousticness 1 58064 7800731 16591
## - energy 1 65381 7808048 16593
## - danceability 1 153980 7896647 16615
## - valence 1 191786 7934452 16625
## - loudness 1 231784 7974451 16635
##
## Step: AIC=16582.28
## popularity ~ energy + valence + acousticness + loudness + danceability +
## mode
##
## Df Sum of Sq RSS AIC
## <none> 7767336 16582
## - mode 1 47088 7814424 16587
## - acousticness 1 61459 7828795 16590
## - energy 1 73438 7840774 16594
## - danceability 1 169156 7936491 16618
## - valence 1 191673 7959009 16623
## - loudness 1 234141 8001477 16634
## Start: AIC=16731.8
## popularity ~ 1
##
## Df Sum of Sq RSS AIC
## + loudness 1 235221 8352149 16678
## + acousticness 1 158871 8428500 16696
## + danceability 1 126569 8460801 16704
## + valence 1 105201 8482170 16709
## + mode 1 54394 8532977 16721
## + log(liveness) 1 52608 8534762 16722
## + instrumentalness 1 25303 8562068 16728
## <none> 8587371 16732
## + energy 1 5878 8581493 16732
## + tempo 1 3652 8583719 16733
## + log(speechiness) 1 1561 8585810 16733
##
## Step: AIC=16678.25
## popularity ~ loudness
##
## Df Sum of Sq RSS AIC
## + valence 1 190638 8161512 16634
## + energy 1 136334 8215815 16647
## + danceability 1 85450 8266700 16660
## + log(liveness) 1 62141 8290008 16665
## + mode 1 54696 8297453 16667
## + acousticness 1 31954 8320196 16673
## <none> 8352149 16678
## + tempo 1 7673 8344476 16678
## + instrumentalness 1 4311 8347838 16679
## + log(speechiness) 1 76 8352074 16680
##
## Step: AIC=16634.07
## popularity ~ loudness + valence
##
## Df Sum of Sq RSS AIC
## + danceability 1 251256 7910256 16574
## + log(liveness) 1 64697 8096815 16620
## + energy 1 58269 8103243 16622
## + acousticness 1 43395 8118117 16625
## + mode 1 40971 8120541 16626
## <none> 8161512 16634
## + tempo 1 7674 8153837 16634
## + log(speechiness) 1 3671 8157841 16635
## + instrumentalness 1 2663 8158849 16635
##
## Step: AIC=16573.54
## popularity ~ loudness + valence + danceability
##
## Df Sum of Sq RSS AIC
## + mode 1 51678 7858578 16562
## + log(liveness) 1 37101 7873155 16566
## + energy 1 35996 7874260 16566
## + acousticness 1 14796 7895460 16572
## <none> 7910256 16574
## + instrumentalness 1 3833 7906423 16575
## + tempo 1 51 7910205 16576
## + log(speechiness) 1 17 7910239 16576
##
## Step: AIC=16562.43
## popularity ~ loudness + valence + danceability + mode
##
## Df Sum of Sq RSS AIC
## + log(liveness) 1 32667 7825912 16556
## + energy 1 29783 7828795 16557
## + acousticness 1 17805 7840774 16560
## <none> 7858578 16562
## + instrumentalness 1 3821 7854757 16564
## + log(speechiness) 1 634 7857944 16564
## + tempo 1 65 7858513 16564
##
## Step: AIC=16556.1
## popularity ~ loudness + valence + danceability + mode + log(liveness)
##
## Df Sum of Sq RSS AIC
## + energy 1 25180.8 7800731 16552
## + acousticness 1 17864.3 7808048 16554
## <none> 7825912 16556
## + instrumentalness 1 4145.1 7821767 16557
## + log(speechiness) 1 915.9 7824996 16558
## + tempo 1 18.9 7825893 16558
##
## Step: AIC=16551.65
## popularity ~ loudness + valence + danceability + mode + log(liveness) +
## energy
##
## Df Sum of Sq RSS AIC
## + acousticness 1 58064 7742667 16539
## <none> 7800731 16552
## + log(speechiness) 1 2543 7798188 16553
## + instrumentalness 1 1904 7798827 16553
## + tempo 1 346 7800385 16554
##
## Step: AIC=16538.71
## popularity ~ loudness + valence + danceability + mode + log(liveness) +
## energy + acousticness
##
## Df Sum of Sq RSS AIC
## <none> 7742667 16539
## + log(speechiness) 1 2334.22 7740333 16540
## + instrumentalness 1 2040.26 7740627 16540
## + tempo 1 36.22 7742631 16541
## Start: AIC=16737.4
## popularity ~ 1
##
## Df Sum of Sq RSS AIC
## + loudness 1 235221 8352149 16690
## + acousticness 1 158871 8428500 16708
## + danceability 1 126569 8460801 16715
## + valence 1 105201 8482170 16720
## + mode 1 54394 8532977 16732
## + log(liveness) 1 52608 8534762 16733
## <none> 8587371 16737
## + instrumentalness 1 25303 8562068 16739
## + energy 1 5878 8581493 16744
## + tempo 1 3652 8583719 16744
## + log(speechiness) 1 1561 8585810 16745
##
## Step: AIC=16689.46
## popularity ~ loudness
##
## Df Sum of Sq RSS AIC
## + valence 1 190638 8161512 16651
## + energy 1 136334 8215815 16664
## + danceability 1 85450 8266700 16676
## + log(liveness) 1 62141 8290008 16682
## + mode 1 54696 8297453 16684
## + acousticness 1 31954 8320196 16689
## <none> 8352149 16690
## + tempo 1 7673 8344476 16695
## + instrumentalness 1 4311 8347838 16696
## + log(speechiness) 1 76 8352074 16697
##
## Step: AIC=16650.88
## popularity ~ loudness + valence
##
## Df Sum of Sq RSS AIC
## + danceability 1 251256 7910256 16596
## + log(liveness) 1 64697 8096815 16643
## + energy 1 58269 8103243 16644
## + acousticness 1 43395 8118117 16648
## + mode 1 40971 8120541 16648
## <none> 8161512 16651
## + tempo 1 7674 8153837 16657
## + log(speechiness) 1 3671 8157841 16658
## + instrumentalness 1 2663 8158849 16658
##
## Step: AIC=16595.94
## popularity ~ loudness + valence + danceability
##
## Df Sum of Sq RSS AIC
## + mode 1 51678 7858578 16590
## + log(liveness) 1 37101 7873155 16594
## + energy 1 35996 7874260 16594
## <none> 7910256 16596
## + acousticness 1 14796 7895460 16600
## + instrumentalness 1 3833 7906423 16603
## + tempo 1 51 7910205 16604
## + log(speechiness) 1 17 7910239 16604
##
## Step: AIC=16590.43
## popularity ~ loudness + valence + danceability + mode
##
## Df Sum of Sq RSS AIC
## + log(liveness) 1 32667 7825912 16590
## <none> 7858578 16590
## + energy 1 29783 7828795 16590
## + acousticness 1 17805 7840774 16594
## + instrumentalness 1 3821 7854757 16597
## + log(speechiness) 1 634 7857944 16598
## + tempo 1 65 7858513 16598
##
## Step: AIC=16589.7
## popularity ~ loudness + valence + danceability + mode + log(liveness)
##
## Df Sum of Sq RSS AIC
## <none> 7825912 16590
## + energy 1 25180.8 7800731 16591
## + acousticness 1 17864.3 7808048 16593
## + instrumentalness 1 4145.1 7821767 16596
## + log(speechiness) 1 915.9 7824996 16597
## + tempo 1 18.9 7825893 16597
##
## Call:
## lm(formula = popularity ~ energy + valence + log(liveness) +
## acousticness + loudness + danceability + mode)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.165 -44.846 1.626 45.435 168.475
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 183.4591 15.4576 11.869 < 2e-16 ***
## energy -54.0635 13.1820 -4.101 4.27e-05 ***
## valence -52.6091 7.4895 -7.024 2.94e-12 ***
## log(liveness) -5.8232 2.3115 -2.519 0.011837 *
## acousticness -26.4360 6.8398 -3.865 0.000115 ***
## loudness 5.9733 0.7735 7.722 1.80e-14 ***
## danceability 69.0998 10.9786 6.294 3.79e-10 ***
## mode 9.5942 2.8589 3.356 0.000806 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.34 on 1992 degrees of freedom
## Multiple R-squared: 0.09837, Adjusted R-squared: 0.0952
## F-statistic: 31.05 on 7 and 1992 DF, p-value: < 2.2e-16
Backward AIC and BIC result: popularity ~ energy + valence + log(liveness) + acousticness + loudness + danceability + mode
Forward AIC: popularity ~ energy + valence + log(liveness) + acousticness + loudness + danceability + mode
Forward BIC : popularity ~ loudness + valence + danceability + mode + log(liveness) + energy + acousticness
##
## Call:
## glm(formula = popularity ~ energy + valence + liveness + tempo +
## speechiness + instrumentalness + acousticness + loudness +
## danceability + mode, family = gaussian(link = "log"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -189.373 -44.938 0.473 46.311 167.690
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.400e+00 1.248e-01 43.255 < 2e-16 ***
## energy -4.008e-01 9.879e-02 -4.057 5.16e-05 ***
## valence -3.704e-01 5.534e-02 -6.693 2.84e-11 ***
## liveness -1.617e-01 8.199e-02 -1.973 0.048661 *
## tempo -5.595e-06 3.868e-04 -0.014 0.988462
## speechiness 1.063e-01 1.275e-01 0.834 0.404302
## instrumentalness -1.056e-01 1.086e-01 -0.972 0.331007
## acousticness -1.843e-01 5.204e-02 -3.542 0.000407 ***
## loudness 4.687e-02 6.353e-03 7.378 2.34e-13 ***
## danceability 4.675e-01 8.294e-02 5.636 1.99e-08 ***
## mode 6.084e-02 2.121e-02 2.868 0.004170 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 3910.523)
##
## Null deviance: 8587371 on 1999 degrees of freedom
## Residual deviance: 7778048 on 1989 degrees of freedom
## AIC: 22232
##
## Number of Fisher Scoring iterations: 6
##
## Call:
## glm(formula = popularity ~ energy + valence + log(liveness) +
## tempo + log(speechiness) + instrumentalness + acousticness +
## loudness + danceability + mode, family = gaussian(link = "log"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -188.343 -45.431 0.473 46.323 166.508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.361e+00 1.452e-01 36.913 < 2e-16 ***
## energy -4.041e-01 9.899e-02 -4.082 4.64e-05 ***
## valence -3.743e-01 5.532e-02 -6.767 1.72e-11 ***
## log(liveness) -3.847e-02 1.732e-02 -2.221 0.026462 *
## tempo -2.322e-05 3.867e-04 -0.060 0.952130
## log(speechiness) 1.545e-02 1.547e-02 0.999 0.317948
## instrumentalness -1.055e-01 1.086e-01 -0.972 0.331133
## acousticness -1.813e-01 5.203e-02 -3.485 0.000503 ***
## loudness 4.698e-02 6.342e-03 7.408 1.89e-13 ***
## danceability 4.569e-01 8.366e-02 5.462 5.31e-08 ***
## mode 6.104e-02 2.123e-02 2.875 0.004079 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 3908.22)
##
## Null deviance: 8587371 on 1999 degrees of freedom
## Residual deviance: 7773467 on 1989 degrees of freedom
## AIC: 22230
##
## Number of Fisher Scoring iterations: 6
## [1] 1.19022
## 7 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 149.707318
## energy -13.507150
## valence -30.854508
## -4.793054
## acousticness -15.551395
## loudness 2.791660
## danceability 38.304137
## 8 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 194.744549
## liveness -24.273827
## energy -50.286777
## mode 9.548689
## acousticness -25.950814
## loudness 5.707692
## valence -51.541311
## danceability 68.391426
## RMSE Rsquare
## 1 65.37451 0.07692129
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
## 8 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 197.573271
## liveness -24.036395
## energy -53.082645
## mode 9.582212
## acousticness -26.461968
## loudness 5.902543
## valence -52.237991
## danceability 69.453568
## RMSE Rsquare
## 1 65.37211 0.07732247
## 8 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 197.562998
## liveness -24.148389
## energy -53.099719
## mode 9.609433
## acousticness -26.529768
## loudness 5.896587
## valence -52.218493
## danceability 69.441182
## RMSE Rsquare
## 1 65.37011 0.07737202
##
## Call:
## summary.resamples(object = ., metric = "RMSE")
##
## Models: ridge, lasso, elastic
## Number of resamples: 10
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## ridge 59.58791 61.13682 62.64813 62.42717 63.44247 65.00980 0
## lasso 59.64059 61.21502 62.62200 62.45917 63.44058 65.08245 0
## elastic 60.10650 61.17273 62.36886 62.46749 63.92159 65.29871 0
The ridge regression yields the most optimum RMSE with the coefficients:
(Intercept) 70.587562856 \ liveness -8.166892612 \ tempo -0.001324982 \ energy -15.921887134 \ speechiness 3.058979282 \ mode 3.114763830 \ instrumentalness -1.940615809 \ acousticness -8.374153049 \ loudness 1.745995798 \ valence -16.069317150 \ danceability 21.587672148 \